Skip to content

feat: Retry middleware + Finagle RetryBudget (0.4.0 slice 1)#22

Merged
lesnik512 merged 27 commits into
mainfrom
feat/v0.4-retry-and-budget
Jun 5, 2026
Merged

feat: Retry middleware + Finagle RetryBudget (0.4.0 slice 1)#22
lesnik512 merged 27 commits into
mainfrom
feat/v0.4-retry-and-budget

Conversation

@lesnik512

@lesnik512 lesnik512 commented Jun 5, 2026

Copy link
Copy Markdown
Member

Summary

Ships the first ship-unit of Epic 3 (Resilience): a Retry middleware backed by a Finagle-style RetryBudget token bucket, plus a NetworkError(TransportError) refinement so retry can distinguish transient failures from non-retryable ones.

  • Retry middleware (from httpware import Retry) — retries on transient network errors (NetworkError), timeouts (TimeoutError), and retryable status codes (408/429/502/503/504) for idempotent methods (GET/HEAD/OPTIONS/PUT/DELETE) by default. Exponential backoff with full jitter, honors Retry-After (seconds + HTTP-date forms, capped at max_delay), optional attempt_timeout= wall-clock cap per attempt, PEP-678 note on exhaustion.
  • RetryBudget token bucket (from httpware import RetryBudget) — Finagle/AWS-SDK defaults (ttl=10s, percent_can_retry=0.2, min_retries_per_sec=10). Default is fresh per-Retry-instance; explicit RetryBudget can be shared across multiple clients.
  • NetworkError(TransportError) — refines AsyncClient._terminal mapping so httpx2.ConnectError/ReadError/WriteError/CloseError raise NetworkError; InvalidURL/CookieConflict keep raising bare TransportError. Backwards-compat (catches of TransportError still work).
  • RetryBudgetExhaustedError(ClientError) — raised when retry was needed but the budget refused; carries last_response/last_exception/attempts.

3-1 (per-attempt timeout) dissolved into Retry.attempt_timeout= rather than shipping standalone. Stories 3-2/3-3/3-4 land here. Remaining Epic 3: 3-5 Bulkhead, 3-6 extension-slot docs.

Out of scope (deliberate, per spec): per-call retry override via extensions, Backoff protocol, retry_on_exception= configuration, streaming-body retry (Epic 4 prerequisite — tracked in planning/deferred-work.md).

Test Plan

  • just test — 183 passing, 100% line coverage
  • just lint-ci — eof-fixer, ruff format, ruff check, ty check all clean
  • Architecture invariants (per CLAUDE.md) verified: no httpx2._ private API, no from __future__ import annotations, no print(), no global logging, no # type:/# mypy: ignore
  • Optional-extras isolation verified (Retry/RetryBudget ship in core, no new optional dep)
  • Hypothesis property tests for both RetryBudget and Retry
  • RetryBudgetExhaustedError pickle round-trip tested (kwargs-only init needs __reduce__)
  • Reviewer sanity-check that NetworkError correctly routes the four transient httpx2 exception subclasses (verified via tests/test_error_mapping_terminal.py)

Refs

🤖 Generated with Claude Code

lesnik512 and others added 27 commits June 5, 2026 00:56
The original scaffold's __init__.py imported Retry and RetryBudget before
either class existed, which both broke ruff F401 and would have raised
ImportError at module load time during the intermediate tasks (Tasks 4-6
import from httpware.middleware.resilience.budget, which triggers parent
__init__.py execution).

Defer the re-exports to Task 7 once both classes are defined; update the
plan to reflect this. Pure dev-loop fix, no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ailures

Refines _terminal so httpx2.NetworkError-family exceptions (ConnectError, ReadError,
WriteError, PoolTimeout) map to httpware.NetworkError. InvalidURL and CookieConflict
stay bare TransportError. Prerequisite for the Retry middleware so it can retry
transient failures without retrying typos.
PoolTimeout inherits from httpx2.TimeoutException, not NetworkError, so
the bucket is connect/read/write/close. The dispatch logic is unchanged
(PoolTimeout was already caught by the timeout branch above NetworkError),
but the docstring was misleading future maintainers and the upcoming Retry
middleware author about what NetworkError actually wraps.

Update spec + plan to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Distinct exception raised by the Retry middleware when the RetryBudget
refuses to permit a retry. Carries last_response / last_exception / attempts.
Inherits ClientError so callers catching ClientError already handle it.
… assertion

Keyword-only __init__ breaks Python's default exception pickle protocol
(unpickle calls cls(str_message) positionally). Mirror StatusError's pattern:
add _reconstruct_budget_exhausted + __reduce__. A budget-exhausted error is
the most likely error to cross process boundaries (the budget exists to
protect against load), so silently losing diagnostic context via UnpicklingError
would be bad.

Also tightens the summary-message test from "'5' in str(exc)" to an exact-string
match — the digit-in-string check would pass for many wrong messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Finagle-style: ttl=10s, min_retries_per_sec=10, percent_can_retry=0.2.
Deterministic time via injected _now callable for tests.
- Add a comment to _purge explaining the strict-< window semantics
  ([now-ttl, now] inclusive). Prevents a future reader (or Hypothesis
  test in Task 5) from expecting <= and being surprised.
- Tighten the test_deposit_after_exhaustion docstring — "floor (int
  truncation)" used the word "floor" for two different things in one
  sentence; rephrase to "int() truncates to 0".

Pure documentation; no behavior change. 145 tests still pass at 100%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t ttl margin

- test_more_deposits_never_decreases_budget: was only varying extra_deposits
  with ttl/min_rps/percent hardcoded; now parameterizes all four so the
  monotonicity claim covers the actual claimed parameter space (e.g.
  percent=0.0 case where monotonicity comes entirely from the floor).
- test_advancing_past_ttl_purges_deposits: replace ttl + 0.1 epsilon with
  ttl * 2.0. Self-evident; stays safe regardless of strategy max_value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements full_jitter_delay() per AWS's "full jitter" formulation:
sleep = uniform(0, min(max_delay, base_delay * 2**attempt_index)).
Injects a deterministic _random_uniform kwarg for testability.
Adds tests/test_backoff.py to keep 100% coverage gate green until
Task 7 adds Retry integration tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…attempt_index

base_delay * (2 ** attempt_index) raises OverflowError at attempt_index >= 1024
because 2**1024 is too large for int→float conversion. Float exponentiation
saturates to math.inf, which min() then clamps to max_delay. The current Retry
default max_attempts=3 makes this unreachable in practice, but the function
takes attempt_index unbounded, so closing the trap is a one-char fix at zero
behavior cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hypothesis writes a constants/ cache under .hypothesis/ when property tests
run, which makes `just lint` (eof-fixer .) try to "fix" cache files outside
the tracked tree. Add to root .gitignore alongside .pytest_cache / .ruff_cache.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers: happy path, 503-then-200, max_attempts exhaustion with PEP-678 note,
idempotency gate (POST not retried by default, opt-in via retry_methods),
non-retryable status passthrough (404 raised immediately).
Exception-based retry, attempt_timeout, Retry-After, and budget integration
follow in subsequent commits.
…__init__.py

- retry.py:106: spec extracted the assert message to its own line for ruff's
  message-extraction rule; that line wasn't covered by the trailing # pragma:
  no cover. Add the pragma on the msg line too. 100% gate green again.
- resilience/__init__.py: Task 7 plan Step 4 was to re-export Retry and
  RetryBudget; missed in the prior commit. Re-export them with __all__ so
  the public resilience package surface is complete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Retry.__call__: assert last_exc is not None is stripped by python -O,
  which would convert the structural invariant into an AttributeError
  ("NoneType has no attribute add_note") in optimized runtimes. Replace
  with an if guard that raises AssertionError unconditionally.
- test_budget_exhausted_raises_retry_budget_exhausted_error: add a NOTE
  warning Task 11 not to duplicate this test (it lives in Task 7 only
  because the budget gate's RetryBudgetExhaustedError branch needs coverage
  from the Retry-loop side).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Retries NetworkError and TimeoutError on idempotent methods.
Bare TransportError (e.g., InvalidURL) is NOT retried since it
escaped the NetworkError refinement in errors.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…with network test)

test_retries_on_network_error already asserts len(sleeper.calls) == 1; the
matching timeout test didn't, so a regression bypassing backoff on the
TimeoutError branch would go undetected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps each attempt in asyncio.timeout(); maps asyncio.TimeoutError to
httpware.TimeoutError. Caught timeouts count as retryable failures
subject to the idempotency + attempt-count gates.
- retry.py: add inline comment to wrapped.__cause__ = exc explaining it
  is load-bearing on the retry path (last_exc = wrapped) where no `raise
  ... from ...` clause sets the cause. Prevents future "cleanup" that
  silently breaks exhaustion chain display.
- test_retry.py: expand the pragma rationale to record how it was
  verified the assertions still execute (intentional break → test fails).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parsed delay overrides backoff and is capped at max_delay. Malformed values
fall back to backoff. respect_retry_after=False disables the override.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _parse_retry_after: clamp the integer branch to max(0.0, ...) matching
  the HTTP-date branch. Negative Retry-After values from malformed servers
  silently became negative delays passed to asyncio.sleep (which treats
  them as 0). Explicit clamp removes the inconsistency.
- test_respect_retry_after_false_ignores_header: add len(sleeper.calls)==1
  assertion before indexing, matching the pattern used by test_retries_503.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies RetryBudgetExhaustedError field population (last_response on
status path, last_exception on network path), per-instance fresh default
budget, and explicit budget sharing across multiple Retry middlewares.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…orkError

Completes the 0.4.0 slice 1: retry middleware + Finagle-style budget
+ NetworkError refinement for transient httpx2 failures.
Stale references to "Task 7" and "through Tasks 6-7" don't survive
the shipped branch. Reword to describe what the file actually does
(unit coverage of the pure helper; integration via Retry tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lesnik512 lesnik512 self-assigned this Jun 5, 2026
@lesnik512

Copy link
Copy Markdown
Member Author

Code Review

Overview

Ships Epic 3 slice 1 (Resilience):

  • Retry middleware — idempotent-method retry on StatusError (408/429/502/503/504), NetworkError, TimeoutError, optional per-attempt attempt_timeout=. Full-jitter exponential backoff with Retry-After honoring (seconds + HTTP-date, capped at max_delay).
  • RetryBudget — Finagle/AWS-SDK token bucket (ttl=10s, percent=0.2, floor=10/s). Per-instance by default; explicit budget can be shared.
  • NetworkError(TransportError) — refines _terminal mapping for retryability without breaking TransportError catches.
  • RetryBudgetExhaustedError(ClientError) — pickleable via __reduce__, carries last_response/last_exception/attempts.

Correctness

Solid:

  • _terminal exception ordering is correct (src/httpware/client.py:108-119): TimeoutException → bare (InvalidURL, CookieConflict)NetworkErrorHTTPError. NetworkError precedes HTTPError (subclass before parent); InvalidURL/CookieConflict are caught before NetworkError so they stay non-retryable. New test test_httpx2_decoding_error_maps_to_transport_error pins that DecodingError (non-transient RequestError subclass) falls through to bare TransportError.
  • Status-code retry uses the existing exception flow (retry.py:108-112): catches StatusError, inspects exc.response.status_code, re-raises the original subclass on exhaustion. No special wiring — this is exactly the right protocol-seam usage.
  • add_note exhaustion path is set on the live exception before raise last_exc; PEP 678 note is preserved in traceback output.
  • asyncio.timeout wrapping (retry.py:118-124): the manually-set __cause__ is necessary because the retry branch stashes wrapped in last_exc and re-raises without a from clause. The inline comment makes this non-obvious choice readable.
  • full_jitter_delay uses 2.0 ** attempt_index to saturate to math.inf at large attempt indices instead of raising OverflowError. Subtle and well-commented.
  • _parse_retry_after clamps negative integers (retry.py:51) to 0.0; HTTP-date branch clamps clock-skew negatives via max(0.0, delta). RFC 9110 allows only integer seconds, so falling through to date parsing for \"12.5\" is spec-correct.
  • RetryBudgetExhaustedError.__reduce__ uses a module-level reconstructor (_reconstruct_budget_exhausted), which is the right approach for kwargs-only __init__ — closures/lambdas wouldn't pickle.

One minor observation, not a blocker:

  • RetryBudget.deposit() is called every attempt including the first (retry.py:101). Even non-retryable failures (NotFoundError, InvalidURL) deposit a token. That matches Finagle's "deposit per request" semantics — flagging only because it means a high-volume client hitting a stream of 404s slowly accrues budget headroom for unrelated 503 storms. The spec explicitly treats this as correct.

Project conventions

All CI-enforced invariants respected:

  • No httpx2._ private API
  • No from __future__ import annotations
  • No print() / no global logging
  • No # type: / # mypy: ignores (per-line # noqa with justifications throughout)
  • Module constants pattern: DEFAULT_RETRY_STATUS_CODES, _MAX_ATTEMPTS_INVALID live at module top
  • Tests use httpx2.MockTransport injection (no respx, no RecordedTransport)
  • import http.HTTPStatus used over invented constants
  • Retry ships in core (no optional extra) — appropriate; pure-stdlib dependencies
  • Public re-exports added to httpware/__init__.py in alphabetized __all__

Test coverage

  • just test 183 passing, 100% line coverage
  • Property tests: test_budget_props.py (3 properties: bounded retries, TTL expiry purges, more deposits never decreases budget); test_retry_props.py (4 properties: attempts ≤ max_attempts, total sleep ≤ max_attempts × max_delay, non-retryable status/method = 1 attempt). Appropriate for the concurrency-sensitive code paths CLAUDE.md calls out.
  • _SleepRecorder + injected _sleep= keeps the suite instant — no freezegun, no real waits.
  • Pickle round-trip for RetryBudgetExhaustedError verifies __reduce__ works.
  • The test_attempt_timeout_fires_and_retries # pragma: no cover rationale (coverage[thread] losing the coroutine frame after asyncio.timeout cancellation) is documented inline. Verifiable by intentionally breaking the assert — acceptable.

Gaps worth noting (not blockers):

  • No direct unit test for _parse_retry_after(\"-5\") clamp — covered transitively by retry tests but a focused assertion would document the behavior more clearly.
  • Pickle test uses RuntimeError(\"boom\") for last_exception; doesn't probe an unpickleable inner exception (e.g., one referencing a live socket). Edge case, downstream concern.

Suggestions for follow-up

These are improvements, not change requests for this PR:

  1. Consider extracting _compute_delay from Retry.__call__ into a module-private function — would make the Retry-After-vs-backoff branch unit-testable directly and shrink the # noqa: C901, PLR0912 complexity budget.
  2. Logging at retry decision points is deferred to Epic 5 (Observability 5-2), so Retry is currently silent on retry decisions. Reasonable; flag for the Epic 5 work to make sure budget-refusal and exhaustion events plumb through.
  3. Streaming-body retry correctly deferred to planning/deferred-work.mdRetry will need to refuse retries on streamed bodies when AsyncClient.stream lands (4-3).

Risks

  • Behavioral change for users catching TransportError: NetworkError is a subclass, so existing catches still work. Verified by the diffed test test_httpx2_connect_error_maps_to_network_error (was TransportError, now NetworkError). Backwards-compat preserved.
  • Default RetryBudget is fresh per Retry instance: explicit and documented in PR body. Users instantiating multiple Retry() middlewares across clients won't get cross-client budget protection unless they pass a shared RetryBudget(). Worth a callout in the public docs site (Epic 6).

Verdict

Ship it. Careful, well-tested, conventions-respecting work. The Finagle defaults are battle-tested numbers and the _now/_sleep/_random_uniform DI design keeps the test surface tight without freezegun.


🤖 Generated with Claude Code

@lesnik512 lesnik512 merged commit 2774640 into main Jun 5, 2026
5 checks passed
@lesnik512 lesnik512 deleted the feat/v0.4-retry-and-budget branch June 5, 2026 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant